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Abstract 

Background: Recently, information derived by correlated mutations in proteins has regained relevance for 
predicting protein contacts. This is due to new forms of mutual information analysis that have been proven to be 
more suitable to highlight direct coupling between pairs of residues in protein structures and to the large number 
of protein chains that are currently available for statistical validation. It was previously discussed that disulfide bond 
topology in proteins is also constrained by correlated mutations. 

Results: In this paper we exploit information derived from a corrected mutual information analysis and from the 
inverse of the covariance matrix to address the problem of the prediction of the topology of disulfide bonds in 
Eukaryotes. Recently, we have shown that Support Vector Regression (SVR) can improve the prediction for the 
disulfide connectivity patterns. Here we show that the inclusion of the correlated mutation information increases of 
5 percentage points the SVR performance (from 54% to 59%). When this approach is used in combination with a 
method previously developed by us and scoring at the state of art in predicting both location and topology of 
disulfide bonds in Eukaryotes (DisLocate), the per-protein accuracy is 38%, 2 percentage points higher than that 
previously obtained. 

Conclusions: In this paper we show that the inclusion of information derived from correlated mutations can 
improve the performance of the state of the art methods for predicting disulfide connectivity patterns in 
Eukaryotic proteins. Our analysis also provides support to the notion that improving methods to extract 
evolutionary information from multiple sequence alignments greatly contributes to the scoring performance of 
predictors suited to detect relevant features from protein chains. 



Background 

Disulfide bonds are covalent cross-links between cysteine 
side chains that play very important roles in the native 
structures of globular proteins. Folding, stability, and ulti- 
mately function of secreted proteins in cells are influenced 
by the formation of disulfide bonds between cysteine resi- 
dues [1]. Predicting the topology and the location of disul- 
fide bridges in a protein from its sequence therefore plays 
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a relevant role in protein structural and functional annota- 
tion. Several computational methods are presently avail- 
able for computing cysteine properties in a protein 
sequence and they can be grouped into: i) methods that 
predict the disulfide bonding state [2-4]; ii) methods that 
predict the topological connectivity patterns by assuming 
that the cysteine bonding state is known [5-8]; iii) methods 
that compute both i) and ii)[9-12]. Recently we developed 
DisLocate, a two-stage method for disulfide bond predic- 
tion in Eukaryotes comprising two integrated modules. 
The first based on Conditional Random Fields (CRFs) pre- 
dicts the cysteine bonding state; the second based on a 
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Support Vector Regression (SVR) predicts the topology of 
the disulfide bridges [12]. DisLocate improved over pre- 
vious methods by introducing for the first time the infor- 
mation of the protein subcellular localization in the 
prediction of the disulfide bonding state [12]. 

Here we address the problem of improving the second 
step of the prediction, namely the prediction of disulfide 
connectivity pattern, by exploiting the role of correlated 
mutations. Correlated mutation analysis aims at elucidat- 
ing relations between pairs of residues in the protein 
structure that may influence its folding. Routinely, this is 
done through the identification of the co-evolution of dif- 
ferent positions in a multiple sequence alignment. The 
notion of correlated mutation describes that an unfavour- 
able residue mutation in a structural contact can be com- 
pensated by the simultaneous change of the direct 
partner in such a way that the original interaction is pre- 
served (compensatory mutation) [13]. It has been 
recently observed that with sufficient and correct infor- 
mation about protein residue-residue contacts it is possi- 
ble to predict some protein structures from the residue 
chain [13-16]. 

Correlated mutation analysis was also introduced in 
the context of disulfide bond connectivity prediction. 
Simple correlation patterns of concerted appearing and 
disappearing cysteines in multiple structural alignments 
were used to predict the topology of disulfide bonds in 
proteins [17]. 

In the present paper we propose the usage of informa- 
tion derived from correlated mutations to improve the 
prediction of disulfide connectivity over a set proteins 
including 1797 chains (PDBCYS). We evaluate two differ- 
ent approaches of computing the correlated mutations: 
corrected mutual information (MIp) and sparse inverse 
covariance estimation (iCOV). MIp is a corrected version 
of mutual information specifically designed to remove the 
background noise due to both phylogenetic and entropic 
biases [18]. The latter approach (iCOV) which is based on 
sparse inverse covariance estimation was recently intro- 
duced for the problem of predicting contact maps [20]. 
Here we combine information derived with both methods 
for computing correlated mutations with features that 
were previously found relevant for predicting the disulfide 
connectivity and implemented in our DisLocate [12]. In 
order to highlight the effect of correlated mutations we 
benchmark the newly developed predictors on the same 
dataset (PDBCYS) previously adopted to evaluate DisLo- 
cate [12]. Our results show that correlated mutation analy- 
sis adds to the previously introduced features and 
improves the prediction scores. This indicates that corre- 
lated mutations are a significant piece of information also 
when computing the connectivity pattern of disulfide 
bridges in protein structures. 



Methods 

Mutual information among cysteines 

Mutual Information (MI) can be used to provide a mea- 
sure of the co-evolution of two positions in a protein 
sequence. In protein structures the measures of co-evolu- 
tion and MI in particular have been extensively applied 
for predicting residue contacts in proteins [13-16]. Here 
we focus only on sequence positions that contain cysteine 
residues. We then compute a multiple sequence align- 
ment for each protein of interest and we extract the posi- 
tions that correspond to cysteines in the query 
sequences. By this, we end up with sub-alignments that 
contain as many columns as the number of cysteines that 
are present in the query sequence. The Mutual Informa- 
tion MI between cysteines i and /' is then computed as 
follows: 



MJ(z,j) = £/y(<a)log 



a,b 



mm 



(i) 



where f t (a) and f/b) are the relative frequencies of 
amino acid types a and b at position i and /, respec- 
tively, and fyfab) is the relative frequency of the amino 
acid pair ab at positions ij. 

The MI metric suffers of several drawbacks mainly due 
to entropic effects and phylogenetic biases that reduce its 
efficacy in predicting residue contacts [18]. The entropic 
bias occurs when a given position in the multiple align- 
ment exhibits a high variability (entropy). These positions 
tend to have higher level of MI than those with a lesser 
entropy [18]. The phylogenetic bias is due to the phyloge- 
netic relationships between organisms represented in the 
alignment that may generate an uneven distribution of 
sequence residues [18]. In order to overcome these issues, 
it has been proposed to correct the MI values as computed 
in Equation 1 by the so called average product correction 
(APC) [18]. APC measures the background signal of MI 
due to entropic and phylogenetic biases. This corrected 
metric is called MIp. The MIp for positions i and / is then 
obtained as follows: 



MI[i, -)MI(-,j) 



MI p {i,j)=MI{i,j) ^= 

MI 



(2) 



where MI(i, — ) is the average mutual information 
between position i and all other positions (analogously 
MI{—,j) for position /) and Mf is the average mutual 
information of all positions. 

Sparse inverse covariance estimation 

In recent works it has been pointed out that it is possi- 
ble to improve the co-evolutionary information using 
the inverse of the covariance matrix [19,20]. In particu- 
lar, using information stored into the inverse of the 
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covariance matrix, the performance of contact predic- 
tion improves significantly with respect to the simple 
MI or MIp by reducing the so-called indirect coupling 
effect i.e. the statistical dependency observed in multiple 
sequence alignment for residues that are structurally 
distant [19-21]. One of the proposed approach (followed 
here), is based on sparse inverse covariance estimation 
and it is called PSICOV [20]. 

In this paper, we apply an approach similar to PSICOV 
to estimate the level of direct coupling between cysteine 
residues in proteins. As a result, given a protein with m 
cysteines, we obtained ambyw matrix whose elements 
can be interpreted as disulfide bonding scores. As 
described for MI in the previous section, we consider mul- 
tiple sequence alignment constrained to positions corre- 
sponding only to cysteines in the query sequence. For a 
protein sequence with m cysteines, the sample covariance 
matrix can be then computed as follows: 



(3) 



where ft(a), f/b) and fy(a,b) are defined as in the pre- 
vious section and S is a 21w by 21m covariance matrix 
(here we also include the gap as a 21st symbol). 

Assuming that the covariance matrix can be inverted 
(the matrix is not singular), the inverse matrix provides 
information about the degree of direct coupling between 
different positions in the protein sequence [19,20]. Unfor- 
tunately, the covariance matrix can be singular (since we 
do not observe every amino acid in a given position of the 
alignment). In order to estimate the inverse matrix, 
authors proposed to use the sparse inverse covariance esti- 
mation by means of the graphical Lasso optimization pro- 
cedure[22,23]. This procedure attempts to estimate the 
inverse covariance matrix by solving the following optimi- 
zation problem: 



^Sy0y-lOgdet© + P^|© y 



(4) 



where S is a dxd covariance matrix, (9 is the inverse cov- 
ariance matrix and the last term is a regularization term 
(the Ci-norm of the inverse matrix) that favors the sparsity 
of the solutions, p is a hyper-parameter that governs the 
level of desired sparsity (the greater is p the sparser is the 
solution). The disulfide bonding score between cysteines i 
and j of the protein is computed as follows: 



0 



nb 



(5) 



a,b 



where the summation over a and b is taken by exclud- 
ing gaps. Finally, the same average product correction is 
used here to adjust the value for background noise as 
described for MIp: 



C p {i,j) = C(i,j) 



C{i,-)C{-,j) 
C 



(6) 



where C(i, — ) is the mean contact score between posi- 
tion i and all other positions (analogously C(— , j) for posi- 
tion j) and C is the overall mean contact score. We refer 
to this bonding score as iCOV in the rest of the paper. 

Predicting disulfide connectivity patterns 

Once the cysteine bonding state is assigned, we predict the 
connectivity pattern of the subsets of proteins that contain 
at least a pair of cysteines in the bonding state by applying 
a Support Vector Regression approach [12]. The SVR pre- 
dictions of each possible pair of cysteines is used as edge 
weight and the Edmond-Gabow algorithm is adopted to 
predict the most probable disulfide pattern [5]. In order to 
evaluate SVR, we use the same 20-fold cross validation 
procedure described before [12], considering only proteins 
with at least two disulfide bridges. SVRs were trained 
using an input encoding based on global and local infor- 
mation. The global information (that does not depend on 
each particular cysteine pair) is defined by the Normalized 
Protein Length (one real value), the Protein Molecular 
Weight (one real value) and the protein amino acid com- 
position (20 real values). The local pairwise encoding (that 
depends on each particular cysteine pair) consists of the 
following descriptors: 

• two PSSM-based windows centered on the 
cysteines forming the pairs. We used a window of 
length 13, the best performing among the different- 
sized windows we tested. With this choice, we ended 
up with a vector of 13 * 20 * 2 = 520 components; 

♦ the Relative Order of the Cysteines. This feature is 
encoded with 2 real values that represent the normal- 
ized relative order of a cysteines pair. Given a protein 
with n cysteines (Cl,C2,...,Cn), the corresponding nor- 
malized ordered list of cysteines is given by (1/n, 
2/n,...,n/n). For each pair of cysteines, the correspond- 
ing values are then taken from the list (e.g. the pair 
(C1,C4) is encoded as (l/n,4/n)); 

• the Cysteine Separation Distance. This feature is 
encoded with 1 real value that represents the log- 
cysteine sequence separation computed as SEP(Ci, 
Cj) = log (|j - i|) where i and j are sequence posi- 
tions of cysteines Ci and Cj, respectively. 

♦ Correlated mutation information, based on MIp 
and/or iCOV. 



Dataset description 

In this study we used the dataset PDBCYS introduced 
before [12]. From PDB (release May 2010) we extracted 
1797 Eukaryotic protein structures with resolution <2.5 
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A with at least two cysteine residues and global pairwise 
sequence similarity <25%. PDBCYS includes 7619 free 
and 3194 bonded cysteines. Since PDBCYS contains 
pairs of proteins with detectable local sequence similar- 
ity, we clustered all the chains using a local sequence 
similarity score. First, we ran a BLAST sequence search 
using all the proteins of the set versus themselves. Then, 
for each pair of proteins we selected the higher bi-direc- 
tional (say pi vs p2 or p2 vs pi) sequence identity as 
reported in the BLAST output. We subsequently treated 
the proteins as nodes of a graph and assigned an edge 
between two nodes only where local sequence identity 
between the corresponding protein sequences was > 
25%. In addition, we computed the connected compo- 
nents of the graph and treated each group of nodes as a 
protein cluster. Finally, the clusters were grouped in 20 
disjoint sets used to train and test the method. We used 
these 20 subsets to evaluate our method and to compare 
its performance with previous approaches by adopting a 
20-fold cross-validation procedure. 

Performance measures 

In the following N c is the number of correctly predicted 
bonds, N p is the total number of predicted bonds, N b is 
the total number of observed bonds, N patt is number of 
correctly predicted disulfide connectivity pattern and N 
is the total number of chains. 

To score the disulfide connectivity prediction we com- 
puted the following indices: 



♦ the precision P b : 



♦ the recall R b : 



N b 



the Q p : 



Pb 



patt 



N 



(7) 



(8) 



(9) 



For sake of readability in the Tables we report the 
indices in percentage (i.e. the obtained values are multi- 
plied by 100). 

Technical details 

All multiple sequence alignments used to compute both 
the MIp and the iCOV features have been generated by 
running 3 iterations of the jackhmmer program which is 
a part of the HMMER 3.0 package (http://hmmer.org) 
against the UNIREF90 sequence database. The inverse 



covariance estimation was performed by means of the 
glasso R package available at the CRAN archive (http:// 
cran.rproject.org/web/packages/glasso/index.html), the 
same used in [12]. All the estimations have been per- 
formed using the exact algorithm of the glasso code (see 
glasso package documentation for details), glasso algo- 
rithm depends on a parameter p that conditions the 
sparsity of the reconstructed inverse covariance matrix. 
This parameter also affects the algorithm run time: the 
smaller is p the longer is the required time. Below we 
report the results obtained when p is set to le-8, that 
was chosen as trade-off between the computational time 
and the method performance (computed on the valida- 
tion sets). MIp values were computed as described in 
[18]. For the SVR implementation we used the libsvm 
package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) with 
a RBF kernel. 

Results and discussion 

Prediction of disulfide connectivity with known 
bonding state 

In order to evaluate the effect of correlated mutations in 
the task of predicting the topology of disulfide bonds, we 
first assume that disulfide bonded cysteines are known. 
We evaluated the performance of methods considering 
subsets of proteins with a different number of disulfide 
bonds (from 2 to 5). The reported accuracy was obtained 
using the same 20-fold cross validation procedure pre- 
viously described [12]. 

In Table 1 the results obtained by evaluating only the 
correlated mutation information are listed. Both MIp and 
iCOV are evaluated as unsupervised predictors. This was 
done by considering the correlated mutation values com- 
puted with the MIp and iCOV algorithms as a measure of 
the extent of the interaction between pair of cysteines 
without applying any supervised learning procedure. We 
constructed two simple predictors by directly interpreting 
MIp and iCOV as disulfide bonding potentials among 
cysteines and predicting the highest scoring set of cysteine 
pairs as the most probable disulfide connectivity pattern. 
The pattern selection was done by computing the maxi- 
mum-weight perfect matching with the Edmond-Gabow 
algorithm as previously described [5]. The performance of 
these unsupervised predictors reported in Table 1 (43.7- 
44-5% of Qp and 49.9-51.7% of Pb), are significantly 
higher than a random predictor and higher than methods 
that do not include evolutionary information [5] . Differ- 
ently from the case of contact prediction [20] , in our case 
MIp routinely outperformed iCOV with the exception of 
the case of three disulfide bonds, where iCOV obtained 
the highest score. 

In Table 2 we report the performance of the SVR-based 
predictors that include in their input the correlated muta- 
tion information. For sake of comparison, we also report 
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Table 1 Performance on disulfide connectivity prediction 
obtained with correlated mutation measures 



# bonds 


ICOV 




Mlp 




Random 




Pb = Rb 


Qp 


Pb = Rb 


Qp 


Pb = Rb Qp 


2 


62 


62 


68 


68 


33 33 


3 


52.6 


42.4 


47.8 


37.7 


20 7 


4 


51.8 


26.8 


49.4 


29.3 


14 1 


5 


39.5 


16.2 


33.5 


13.5 


11 0.1 


All 


51.7 


43.7 


49.9 


44.5 


23 15 



#bonds: number of disulfide bonds; iCOV: sparse inverse COVariance 
estimation; Mlp: Mlp: corrected Mutual Information. Random: performance 
obtained by a random predictor. Here Pb = Rb since the total number of 
predicted bonds (which is known in this experiment) is equal to the total 
number of observed bonds (W p = N b ). For index definition see Performance 
measures. 

the accuracy per protein of the best method based on SVR 
that does not take advantage of the correlated mutation 
information but exploits all the other input features 
described in the Method section (Table 2, under the col- 
umn labelled SVR). SVR is equivalent to the second mod- 
ule of DisLocate [12]. From the data listed in Table 2, it is 
evident that when the correlated mutation information is 
included in the SVR input, the overall performance 
increases (compare SVR with SVR+iCOV and SVR+MIp). 
In both cases they outperform the baseline SVR predictor, 
in particular for proteins that contain 4 and 5 disulfide 
bonds (it was previously discussed that the difficulty of the 
prediction increases as the number of cysteines increases 
[5]). It is also worth noticing that iCOV seems to add 
more information with respect to Mlp as indicated by the 
relative scoring values (columns SVR+iCOV and SVR 
+MIp). Since iCOV and Mlp appear to capture different 
aspects of the correlation between cysteines, we also evalu- 
ated a SVR-based predictor that includes among its fea- 
tures both correlation measures (SVR+iCOV+MIp). In 
this case the performance of the SVR further increases 
and overpasses by 5 percentage points the recently pub- 
lished SVR method (second step of DisLocate in [12]). 

Prediction of disulfide connectivity with predicted 
bonding state 

In real cases, when new protein sequences are analysed, 
it is not known if some of the cysteines are making 



disulfide bonds in the three dimensional structure of the 
protein. It is then useful evaluating the predictor accu- 
racy starting from an unlabelled sequence and predict- 
ing both the disulfide bonding states and also the 
connectivity pattern. We evaluated the performance of 
the connectivity pattern predictor based on SVR+iCOV 
+MIp when the bonding state of cysteines is not known 
but it is predicted. For this purpose, we adopted the 
bonding state predictor previously introduced in DisLo- 
cate which is based on Grammatical-Restrained Hidden 
Conditional Random Fields [24] and protein subcellular 
localization [12]. In this case, we used the GRHCRF part 
of DisLocate to predict the bonding state and the new 
predictor to assign the connectivity pattern. For sake of 
comparison, we evaluate the method adopting the same 
experimental setup previously described and using the 
same cross-validation procedure [12]. Results are shown 
in Table 3 and indicate that the improvement over Dis- 
Locate is 2 percentage points of accuracy per protein. 

Prediction performance as a function of the quality of 
the multiple sequence alignments 

Mlp and iCOV are computed over multiple sequence 
alignments. We therefore evaluate how the number and 
the type of sequences included in the alignment (used 
to compute the correlation among cysteines residues) 
can affect the final result. 

The number of aligned sequences in each multiple 
alignment can vary from sequence to sequence. We eval- 
uate the dependence of the method performance on the 
number of sequences by computing Qp at increasing 
threshold value of the number of proteins included in the 
multiple sequence alignment. The results are reported in 
Figure 1, where it is evident that the method has on aver- 
age a lower performance on proteins, whose correspond- 
ing multiple sequence alignments contain <5000 
sequences. Alternatively, when the number of aligned 
sequences is larger than 10000, the method on average 
optimally scores. However, a large number of aligned 
sequences may be not sufficient if the observed sequence 
variation is not adding any information. In order to high- 
light this effect, we evaluated the method performance as 
a function of the number of effective sequences in the 



Table 2 Performance on disulfide connectivity prediction obtained with different SVR-based methods 


# bonds 


SVR 




SVR+iCOV 




SVR+MI 




SVR+MI+iCOV 






Pb = Rb 


Qp 


Pb = Rb 


Qp 


Pb = Rb 


Qp 


Pb = Rb 


Qp 


2 


75 


75 


76 


76 


73 


73 


76 


76 


3 


60 


48 


62.8 


55.3 


59.6 


50.6 


62.8 


55.3 


4 


57 


44 


67.1 


51.2 


61 


46.3 


67.7 


51.2 


5 


46 


19 


55.1 


27 


54.1 


29.7 


58.9 


324 


All 


60 


54 


65.2 


58.6 


61.9 


55.5 


66.2 


59.3 



# bonds: number of disulfide bonds; Mlp: corrected Mutual Information; iCOV: sparse inverse COVariance estimation; SVR: Support Vector Regression; and their 
combinations as indicated. For details see Methods. Results are evaluated on the PDBCYS dataset [12]. SVR results are taken from [12]. For index definition see 
Performance measures. 
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Table 3 Prediction without a prior knowledge of the 



cysteine bonding state 



# bonds 


DisLocate 




SVR+MI+iCOV 






Rb 


Pb 


Qp 


Rb 


Pb 


Qp 


1 


83 


46 


76 


93 


46 


76 


2 


67 


52 


61 


71 


59 


62 


3 


47 


41 


35 


55 


49 


38 


4 


52 


37 


35 


63 


48 


38 


5 


39 


39 


15 


50 


49 


16 


All 


52 


42 


36 


60 


50 


38 



Legends are as in Table 2. 



alignment (NEFF score) [25]. NEFF is calculated as the 
exponential of the entropy value averaged over all col- 
umns of the multiple-alignment: in this respect NEFF is 
also interpreted as the entropy of a sequence profile 
derived from the multiple-alignment [25] . NEFF is a real 
value ranging from 1 to 20. Multiple sequence alignments 
consisting of very similar sequences (or singletons) have a 
NEFF value close to 1, while random (uniform) align- 
ments generate a NEFF of 20. Figure 2 shows that also 
for the problem at hand, the larger the NEFF value the 
higher is the method performance, achieving the maxi- 
mum at NEFF = 10 (in our dataset the maximum value is 
11). These findings are in agreement with the notion that 
the more representative is the multiple sequence align- 
ment, both in terms of sequence abundance and diver- 
sity, the higher is the expected predictive performance of 
the method [19,20]. 



Conclusions 

The prediction of protein structures from their sequences 
it is still an open problem in Structural Bioinformatics, 
especially considering that the disproportion between the 
number of putative protein sequences with respect to the 
number of known 3D structures is exponentially increas- 
ing. The bonding state of cysteines plays a relevant role in 
stabilizing the tertiary folds of proteins, in defining protein 
functions and in triggering functionally relevant conforma- 
tional changes [26]. The knowledge of disulfide bonds is 
very important to predict the protein structure in ab initio 
and comparative modelling since it poses constraints to 
the possible chain conformations [27,28] . In this paper we 
introduce a new method to predict disulfide bonds starting 
from protein sequence. We investigate the effect of the 
information derived from correlated mutations on the pro- 
blem of predicting the topology of disulfide bonds in pro- 
teins. We show that correlated mutations in the form of 
corrected mutual information (MIp) and inverse of covar- 
iance matrix (iCOV) carry a significant quantity of infor- 
mation that was not completely exploited before for the 
task of disulfide bond prediction. We present a new 
method that implementing information derived from cor- 
related mutations improves the performance over the state 
of the art method DisLocate [12]. Finally, we highlight that 
the optimal performance of the method can be achieved 
when the number of sequences included in the multiple 
alignment from where information on correlated mutation 
is derived is in the range of 10000 protein chains and the 






Ml 




iCOV 




SVRMI 




SVRiCOV 




SVRMIiCOV 



6e+04 



1 I 1 

7e+04 



1 



8e + 04 



-1 — I — | — I — I — I — I — | — I — I — I — I — |— i— 
3e+04 4e+04 5e+04 
Number of aligned sequences 

Figure 1 Scoring the method at increasing number of sequences in the MSA. The accuracy per protein (Qp) of the different methods is 
plotted as a function of the number of protein chains in the multiple sequence alignment (MS/4 quality) used to derive information on 
correlated mutations. MIp: corrected Mutual Information; iCOV: sparse inverse COVariance estimation; SVR: Support Vector Regression; and their 
combinations as indicated. For details see Methods. 
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correspondent NEFF value of the alignment is greater or 
equal to 10. 
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